278 research outputs found

    Individual Privacy vs Population Privacy: Learning to Attack Anonymization

    Full text link
    Over the last decade there have been great strides made in developing techniques to compute functions privately. In particular, Differential Privacy gives strong promises about conclusions that can be drawn about an individual. In contrast, various syntactic methods for providing privacy (criteria such as kanonymity and l-diversity) have been criticized for still allowing private information of an individual to be inferred. In this report, we consider the ability of an attacker to use data meeting privacy definitions to build an accurate classifier. We demonstrate that even under Differential Privacy, such classifiers can be used to accurately infer "private" attributes in realistic data. We compare this to similar approaches for inferencebased attacks on other forms of anonymized data. We place these attacks on the same scale, and observe that the accuracy of inference of private attributes for Differentially Private data and l-diverse data can be quite similar

    Tight Lower Bound for Comparison-Based Quantile Summaries

    Get PDF
    Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles, up to an error of at most ε\varepsilon. That is, an ε\varepsilon-approximate quantile summary first processes a stream of items and then, given any quantile query 0ϕ10\le \phi\le 1, returns an item from the stream, which is a ϕ\phi'-quantile for some ϕ=ϕ±ε\phi' = \phi \pm \varepsilon. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna (SIGMOD '01), stores at most O(1εlogεN)O(\frac{1}{\varepsilon}\cdot \log \varepsilon N) items, where NN is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space f(ε)o(logN)f(\varepsilon)\cdot o(\log N), for any function ff that does not depend on NN. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of (1±ε)ϕ(1\pm \varepsilon)\cdot \phi, and for other related computational tasks.Comment: 20 pages, 2 figures, major revison of the construction (Sec. 3) and some other parts of the pape

    Engineering Streaming Algorithms

    Get PDF
    Streaming algorithms must process a large quantity of small updates quickly to allow queries about the input to be answered from a small summary. Initial work on streaming algorithms laid out theoretical results, and subsequent efforts have involved engineering these for practical use. Informed by experiments, streaming algorithms have been widely implemented and used in practice. This talk will survey this line of work, and identify some lessons learned

    First Author Advantage: Citation Labeling in Research

    Full text link
    Citations among research papers, and the networks they form, are the primary object of study in scientometrics. The act of making a citation reflects the citer's knowledge of the related literature, and of the work being cited. We aim to gain insight into this process by studying citation keys: user-chosen labels to identify a cited work. Our main observation is that the first listed author is disproportionately represented in such labels, implying a strong mental bias towards the first author.Comment: Computational Scientometrics: Theory and Applications at The 22nd CIKM 201

    Scienceography: the study of how science is written

    Full text link
    Scientific literature has itself been the subject of much scientific study, for a variety of reasons: understanding how results are communicated, how ideas spread, and assessing the influence of areas or individuals. However, most prior work has focused on extracting and analyzing citation and stylistic patterns. In this work, we introduce the notion of 'scienceography', which focuses on the writing of science. We provide a first large scale study using data derived from the arXiv e-print repository. Crucially, our data includes the "source code" of scientific papers-the LaTEX source-which enables us to study features not present in the "final product", such as the tools used and private comments between authors. Our study identifies broad patterns and trends in two example areas-computer science and mathematics-as well as highlighting key differences in the way that science is written in these fields. Finally, we outline future directions to extend the new topic of scienceography.Comment: 13 pages,16 figures. Sixth International Conference on FUN WITH ALGORITHMS, 201
    corecore